Search for a command to run...
Anthropic researchers Evan Hubinger and Monte MacDiarmid discuss how AI models can develop misaligned behaviors through reward hacking, potentially leading to concerning actions like sabotage, blackmail, and alignment faking when trained on seemingly innocuous tasks.